{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.3 Summarizing and Computing Descriptive Statistics（汇总和描述性统计）\n",
    "\n",
    "pandas有很多数学和统计方法。大部分可以归类为降维或汇总统计，这些方法是用来从series中提取单个值（比如sum或mean）。还有一些方法来处理缺失值："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>1.40</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>7.10</td>\n",
       "      <td>-4.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>d</th>\n",
       "      <td>0.75</td>\n",
       "      <td>-1.3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    one  two\n",
       "a  1.40  NaN\n",
       "b  7.10 -4.5\n",
       "c   NaN  NaN\n",
       "d  0.75 -1.3"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],\n",
    "                   [np.nan, np.nan], [0.75, -1.3]],\n",
    "                  index=['a', 'b', 'c', 'd'],\n",
    "                  columns=['one', 'two'])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用sum的话，会返回一个series:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "one    9.25\n",
       "two   -5.80\n",
       "dtype: float64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用`axis='columns'` or `axis=1`，计算列之间的和："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    1.40\n",
       "b    2.60\n",
       "c    0.00\n",
       "d   -0.55\n",
       "dtype: float64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sum(axis='columns')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "计算的时候，NA（即缺失值）会被除外，除非整个切片全是NA。我们可以用skipna来跳过计算NA："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a      NaN\n",
       "b    1.300\n",
       "c      NaN\n",
       "d   -0.275\n",
       "dtype: float64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.mean(axis='columns', skipna=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一些reduction方法：\n",
    "\n",
    "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/n95fy.png)\n",
    "\n",
    "一些方法，比如idxmin和idxmax，能返回间接的统计值，比如index value："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>1.40</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>7.10</td>\n",
       "      <td>-4.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>d</th>\n",
       "      <td>0.75</td>\n",
       "      <td>-1.3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    one  two\n",
       "a  1.40  NaN\n",
       "b  7.10 -4.5\n",
       "c   NaN  NaN\n",
       "d  0.75 -1.3"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "one    b\n",
       "two    d\n",
       "dtype: object"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.idxmax()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "还能计算累加值："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>1.40</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>8.50</td>\n",
       "      <td>-4.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>d</th>\n",
       "      <td>9.25</td>\n",
       "      <td>-5.8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    one  two\n",
       "a  1.40  NaN\n",
       "b  8.50 -4.5\n",
       "c   NaN  NaN\n",
       "d  9.25 -5.8"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.cumsum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "另一种类型既不是降维，也不是累加。describe能一下子产生多维汇总数据："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/xu/anaconda/envs/py35/lib/python3.5/site-packages/numpy/lib/function_base.py:4116: RuntimeWarning: Invalid value encountered in percentile\n",
      "  interpolation=interpolation)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>2.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>3.083333</td>\n",
       "      <td>-2.900000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>3.493685</td>\n",
       "      <td>2.262742</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.750000</td>\n",
       "      <td>-4.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>7.100000</td>\n",
       "      <td>-1.300000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            one       two\n",
       "count  3.000000  2.000000\n",
       "mean   3.083333 -2.900000\n",
       "std    3.493685  2.262742\n",
       "min    0.750000 -4.500000\n",
       "25%         NaN       NaN\n",
       "50%         NaN       NaN\n",
       "75%         NaN       NaN\n",
       "max    7.100000 -1.300000"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于非数值性的数据，describe能产生另一种汇总统计："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     a\n",
       "1     a\n",
       "2     b\n",
       "3     c\n",
       "4     a\n",
       "5     a\n",
       "6     b\n",
       "7     c\n",
       "8     a\n",
       "9     a\n",
       "10    b\n",
       "11    c\n",
       "12    a\n",
       "13    a\n",
       "14    b\n",
       "15    c\n",
       "dtype: object"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj = pd.Series(['a', 'a', 'b', 'c'] * 4)\n",
    "obj"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count     16\n",
       "unique     3\n",
       "top        a\n",
       "freq       8\n",
       "dtype: object"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下表是一些描述和汇总统计数据：\n",
    "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/wygi1.png)\n",
    "\n",
    "# 1 Correlation and Covariance (相关性和协方差)\n",
    "\n",
    "假设DataFrame时股价和股票数量。这些数据取自yahoo finace，用padas-datareader包能加载。如果没有的话，用conda或pip来下载这个包："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "conda install pandas-datareader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas_datareader.data as web"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "all_data = {ticker: web.get_data_yahoo(ticker)\n",
    "            for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}\n",
    "\n",
    "price = pd.DataFrame({ticker: data['Adj Close']\n",
    "                      for ticker, data in all_data.items()})\n",
    "\n",
    "volumn = pd.DataFrame({ticker: data['Volumn']\n",
    "                       for ticker, data in all_data.items()})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上面的代码无法直接从yahoo上爬取数据，因为yahoo被verizon收购后，好像是不能用了。于是这里我们直接从下好的数据包里加载。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5.1 Introduction to pandas Data Structures（pandas的数据结构）.ipynb\r\n",
      "5.2 Essential Functionality（主要功能）.ipynb\r\n",
      "5.3 Summarizing and Computing Descriptive Statistics（总结和描述性统计）.ipynb\r\n"
     ]
    }
   ],
   "source": [
    "ls ../examples/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "price = pd.read_pickle('../examples/yahoo_price.pkl')\n",
    "volume = pd.read_pickle('../examples/yahoo_volume.pkl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>AAPL</th>\n",
       "      <th>GOOG</th>\n",
       "      <th>IBM</th>\n",
       "      <th>MSFT</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Date</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2010-01-04</th>\n",
       "      <td>27.990226</td>\n",
       "      <td>313.062468</td>\n",
       "      <td>113.304536</td>\n",
       "      <td>25.884104</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010-01-05</th>\n",
       "      <td>28.038618</td>\n",
       "      <td>311.683844</td>\n",
       "      <td>111.935822</td>\n",
       "      <td>25.892466</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010-01-06</th>\n",
       "      <td>27.592626</td>\n",
       "      <td>303.826685</td>\n",
       "      <td>111.208683</td>\n",
       "      <td>25.733566</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010-01-07</th>\n",
       "      <td>27.541619</td>\n",
       "      <td>296.753749</td>\n",
       "      <td>110.823732</td>\n",
       "      <td>25.465944</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010-01-08</th>\n",
       "      <td>27.724725</td>\n",
       "      <td>300.709808</td>\n",
       "      <td>111.935822</td>\n",
       "      <td>25.641571</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 AAPL        GOOG         IBM       MSFT\n",
       "Date                                                    \n",
       "2010-01-04  27.990226  313.062468  113.304536  25.884104\n",
       "2010-01-05  28.038618  311.683844  111.935822  25.892466\n",
       "2010-01-06  27.592626  303.826685  111.208683  25.733566\n",
       "2010-01-07  27.541619  296.753749  110.823732  25.465944\n",
       "2010-01-08  27.724725  300.709808  111.935822  25.641571"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "price.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>AAPL</th>\n",
       "      <th>GOOG</th>\n",
       "      <th>IBM</th>\n",
       "      <th>MSFT</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Date</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2010-01-04</th>\n",
       "      <td>123432400</td>\n",
       "      <td>3927000</td>\n",
       "      <td>6155300</td>\n",
       "      <td>38409100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010-01-05</th>\n",
       "      <td>150476200</td>\n",
       "      <td>6031900</td>\n",
       "      <td>6841400</td>\n",
       "      <td>49749600</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010-01-06</th>\n",
       "      <td>138040000</td>\n",
       "      <td>7987100</td>\n",
       "      <td>5605300</td>\n",
       "      <td>58182400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010-01-07</th>\n",
       "      <td>119282800</td>\n",
       "      <td>12876600</td>\n",
       "      <td>5840600</td>\n",
       "      <td>50559700</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010-01-08</th>\n",
       "      <td>111902700</td>\n",
       "      <td>9483900</td>\n",
       "      <td>4197200</td>\n",
       "      <td>51197400</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 AAPL      GOOG      IBM      MSFT\n",
       "Date                                              \n",
       "2010-01-04  123432400   3927000  6155300  38409100\n",
       "2010-01-05  150476200   6031900  6841400  49749600\n",
       "2010-01-06  138040000   7987100  5605300  58182400\n",
       "2010-01-07  119282800  12876600  5840600  50559700\n",
       "2010-01-08  111902700   9483900  4197200  51197400"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "volume.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> pct_change(): 这个函数用来计算同colnums两个相邻的数字之间的变化率\n",
    "\n",
    "现在我们计算一下价格百分比的变化："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>AAPL</th>\n",
       "      <th>GOOG</th>\n",
       "      <th>IBM</th>\n",
       "      <th>MSFT</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Date</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2016-10-17</th>\n",
       "      <td>-0.000680</td>\n",
       "      <td>0.001837</td>\n",
       "      <td>0.002072</td>\n",
       "      <td>-0.003483</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-10-18</th>\n",
       "      <td>-0.000681</td>\n",
       "      <td>0.019616</td>\n",
       "      <td>-0.026168</td>\n",
       "      <td>0.007690</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-10-19</th>\n",
       "      <td>-0.002979</td>\n",
       "      <td>0.007846</td>\n",
       "      <td>0.003583</td>\n",
       "      <td>-0.002255</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-10-20</th>\n",
       "      <td>-0.000512</td>\n",
       "      <td>-0.005652</td>\n",
       "      <td>0.001719</td>\n",
       "      <td>-0.004867</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-10-21</th>\n",
       "      <td>-0.003930</td>\n",
       "      <td>0.003011</td>\n",
       "      <td>-0.012474</td>\n",
       "      <td>0.042096</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                AAPL      GOOG       IBM      MSFT\n",
       "Date                                              \n",
       "2016-10-17 -0.000680  0.001837  0.002072 -0.003483\n",
       "2016-10-18 -0.000681  0.019616 -0.026168  0.007690\n",
       "2016-10-19 -0.002979  0.007846  0.003583 -0.002255\n",
       "2016-10-20 -0.000512 -0.005652  0.001719 -0.004867\n",
       "2016-10-21 -0.003930  0.003011 -0.012474  0.042096"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "returns = price.pct_change()\n",
    "returns.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "series的corr方法计算两个，重合的，非NA的，通过index排列好的series。cov计算方差："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.4997636114415116"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "returns['MSFT'].corr(returns['IBM'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8.8706554797035489e-05"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "returns['MSFT'].cov(returns['IBM'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "因为MSFT是一个有效的python属性，我们可以通过更简洁的方式来选中columns："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.4997636114415116"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "returns.MSFT.corr(returns.IBM)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "dataframe的corr和cov方法，能返回一个完整的相似性或方差矩阵："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>AAPL</th>\n",
       "      <th>GOOG</th>\n",
       "      <th>IBM</th>\n",
       "      <th>MSFT</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>AAPL</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.407919</td>\n",
       "      <td>0.386817</td>\n",
       "      <td>0.389695</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GOOG</th>\n",
       "      <td>0.407919</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.405099</td>\n",
       "      <td>0.465919</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>IBM</th>\n",
       "      <td>0.386817</td>\n",
       "      <td>0.405099</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.499764</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MSFT</th>\n",
       "      <td>0.389695</td>\n",
       "      <td>0.465919</td>\n",
       "      <td>0.499764</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          AAPL      GOOG       IBM      MSFT\n",
       "AAPL  1.000000  0.407919  0.386817  0.389695\n",
       "GOOG  0.407919  1.000000  0.405099  0.465919\n",
       "IBM   0.386817  0.405099  1.000000  0.499764\n",
       "MSFT  0.389695  0.465919  0.499764  1.000000"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "returns.corr()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>AAPL</th>\n",
       "      <th>GOOG</th>\n",
       "      <th>IBM</th>\n",
       "      <th>MSFT</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>AAPL</th>\n",
       "      <td>0.000277</td>\n",
       "      <td>0.000107</td>\n",
       "      <td>0.000078</td>\n",
       "      <td>0.000095</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GOOG</th>\n",
       "      <td>0.000107</td>\n",
       "      <td>0.000251</td>\n",
       "      <td>0.000078</td>\n",
       "      <td>0.000108</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>IBM</th>\n",
       "      <td>0.000078</td>\n",
       "      <td>0.000078</td>\n",
       "      <td>0.000146</td>\n",
       "      <td>0.000089</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MSFT</th>\n",
       "      <td>0.000095</td>\n",
       "      <td>0.000108</td>\n",
       "      <td>0.000089</td>\n",
       "      <td>0.000215</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          AAPL      GOOG       IBM      MSFT\n",
       "AAPL  0.000277  0.000107  0.000078  0.000095\n",
       "GOOG  0.000107  0.000251  0.000078  0.000108\n",
       "IBM   0.000078  0.000078  0.000146  0.000089\n",
       "MSFT  0.000095  0.000108  0.000089  0.000215"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "returns.cov()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "用Dataframe的corrwith方法，我们可以计算dataframe中不同columns之间，或row之间的相似性。传递一个series："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AAPL    0.386817\n",
       "GOOG    0.405099\n",
       "IBM     1.000000\n",
       "MSFT    0.499764\n",
       "dtype: float64"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "returns.corrwith(returns.IBM)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "传入一个dataframe能计算匹配的column names质监局的相似性。这里我计算vooumn中百分比变化的相似性："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AAPL   -0.075565\n",
       "GOOG   -0.007067\n",
       "IBM    -0.204849\n",
       "MSFT   -0.092950\n",
       "dtype: float64"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "returns.corrwith(volume)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "传入axis='columns'能做到row-by-row计算。在correlation被计算之前，所有的数据会根据label先对齐。\n",
    "\n",
    "# 2 Unique Values, Value Counts, and Membership（唯一值，值计数，会员）\n",
    "\n",
    "这里介绍另一种从一维series中提取信息的方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "第一个函数时unique，能告诉我们series里unique values有哪些："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['c', 'a', 'd', 'b'], dtype=object)"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "uniques = obj.unique()\n",
    "uniques"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "返回的unique values不是有序的，但我们可以排序，uniques.sort()。相对的，value_counts能计算series中值出现的频率："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    3\n",
       "c    3\n",
       "b    2\n",
       "d    1\n",
       "dtype: int64"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "返回的结果是按降序处理的。vaule_counts也是pandas中的方法，能用在任何array或sequence上："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "d    1\n",
       "c    3\n",
       "b    2\n",
       "a    3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.value_counts(obj.values, sort=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "isin 能实现一个向量化的集合成员关系检查，能用于过滤数据集，检查一个子集，是否在series的values中，或在dataframe的column中："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    c\n",
       "1    a\n",
       "2    d\n",
       "3    a\n",
       "4    a\n",
       "5    b\n",
       "6    b\n",
       "7    c\n",
       "8    c\n",
       "dtype: object"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     True\n",
       "1    False\n",
       "2    False\n",
       "3    False\n",
       "4    False\n",
       "5     True\n",
       "6     True\n",
       "7     True\n",
       "8     True\n",
       "dtype: bool"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mask = obj.isin(['b', 'c'])\n",
    "mask"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    c\n",
       "5    b\n",
       "6    b\n",
       "7    c\n",
       "8    c\n",
       "dtype: object"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj[mask]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与isin相对的另一个方法是Index.get_indexer，能返回一个index array，告诉我们有重复值的values(to_match)，在非重复的values(unique_vals)中对应的索引值："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])\n",
    "unique_vals = pd.Series(['c', 'b', 'a'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 2, 1, 1, 0, 2])"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.Index(unique_vals).get_indexer(to_match)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unique, value counts, and set membership methods：\n",
    "\n",
    "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/0v120.png)\n",
    "\n",
    "在某些情况下，你可能想要计算一下dataframe中多个column的柱状图："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Qu1</th>\n",
       "      <th>Qu2</th>\n",
       "      <th>Qu3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Qu1  Qu2  Qu3\n",
       "0    1    2    1\n",
       "1    3    3    5\n",
       "2    4    1    2\n",
       "3    3    2    4\n",
       "4    4    3    4"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],\n",
    "                     'Qu2': [2, 3, 1, 2, 3],\n",
    "                     'Qu3': [1, 5, 2, 4, 4]})\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "把padas.value_counts传递给dataframe的apply函数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Qu1</th>\n",
       "      <th>Qu2</th>\n",
       "      <th>Qu3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Qu1  Qu2  Qu3\n",
       "1  1.0  1.0  1.0\n",
       "2  NaN  2.0  1.0\n",
       "3  2.0  2.0  NaN\n",
       "4  2.0  NaN  2.0\n",
       "5  NaN  NaN  1.0"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = data.apply(pd.value_counts)\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "每一行的laebls(即1，2，3，4，5)其实就是整个data里出现过的值，从1到5。而对应的每个方框里的值，则是表示该值在当前列中出现的次数。比如，(2, Qu1)的值是Nan，说明2这个数字没有在Qu1这一列出现过。(2, Qu2)的值是2，说明2这个数字在Qu2这一列出现过2次。(2, Qu3)的值是1，说明2这个数字在Qu3这一列出现过1次。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [py35]",
   "language": "python",
   "name": "Python [py35]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
